20 research outputs found
Transductive Log Opinion Pool of Gaussian Process Experts
We introduce a framework for analyzing transductive combination of Gaussian
process (GP) experts, where independently trained GP experts are combined in a
way that depends on test point location, in order to scale GPs to big data. The
framework provides some theoretical justification for the generalized product
of GP experts (gPoE-GP) which was previously shown to work well in practice but
lacks theoretical basis. Based on the proposed framework, an improvement over
gPoE-GP is introduced and empirically validated.Comment: Accepted at NIPS2015 Workshop on Nonparametric Methods for Large
Scale Representation Learnin
Automatic Selection of t-SNE Perplexity
t-Distributed Stochastic Neighbor Embedding (t-SNE) is one of the most widely
used dimensionality reduction methods for data visualization, but it has a
perplexity hyperparameter that requires manual selection. In practice, proper
tuning of t-SNE perplexity requires users to understand the inner working of
the method as well as to have hands-on experience. We propose a model selection
objective for t-SNE perplexity that requires negligible extra computation
beyond that of the t-SNE itself. We empirically validate that the perplexity
settings found by our approach are consistent with preferences elicited from
human experts across a number of datasets. The similarities of our approach to
Bayesian information criteria (BIC) and minimum description length (MDL) are
also analyzed
Better Long-Range Dependency By Bootstrapping A Mutual Information Regularizer
In this work, we develop a novel regularizer to improve the learning of
long-range dependency of sequence data. Applied on language modelling, our
regularizer expresses the inductive bias that sequence variables should have
high mutual information even though the model might not see abundant
observations for complex long-range dependency. We show how the `next sentence
prediction (classification)' heuristic can be derived in a principled way from
our mutual information estimation framework, and be further extended to
maximize the mutual information of sequence variables. The proposed approach
not only is effective at increasing the mutual information of segments under
the learned model but more importantly, leads to a higher likelihood on holdout
data, and improved generation quality. Code is released at
https://github.com/BorealisAI/BMI.Comment: Camera-ready for AISTATS 202
Few-Shot Self Reminder to Overcome Catastrophic Forgetting
Deep neural networks are known to suffer the catastrophic forgetting problem,
where they tend to forget the knowledge from the previous tasks when
sequentially learning new tasks. Such failure hinders the application of deep
learning based vision system in continual learning settings. In this work, we
present a simple yet surprisingly effective way of preventing catastrophic
forgetting. Our method, called Few-shot Self Reminder (FSR), regularizes the
neural net from changing its learned behaviour by performing logit matching on
selected samples kept in episodic memory from the old tasks. Surprisingly, this
simplistic approach only requires to retrain a small amount of data in order to
outperform previous methods in knowledge retention. We demonstrate the
superiority of our method to the previous ones in two different continual
learning settings on popular benchmarks, as well as a new continual learning
problem where tasks are designed to be more dissimilar
Adversarial Contrastive Estimation
Learning by contrasting positive and negative samples is a general strategy
adopted by many methods. Noise contrastive estimation (NCE) for word embeddings
and translating embeddings for knowledge graphs are examples in NLP employing
this approach. In this work, we view contrastive learning as an abstraction of
all such methods and augment the negative sampler into a mixture distribution
containing an adversarially learned sampler. The resulting adaptive sampler
finds harder negative examples, which forces the main model to learn a better
representation of the data. We evaluate our proposal on learning word
embeddings, order embeddings and knowledge graph embeddings and observe both
faster convergence and improved results on multiple metrics.Comment: Association for Computational Linguistics, 201
Adversarial Manipulation of Deep Representations
We show that the representation of an image in a deep neural network (DNN)
can be manipulated to mimic those of other natural images, with only minor,
imperceptible perturbations to the original image. Previous methods for
generating adversarial images focused on image perturbations designed to
produce erroneous class labels, while we concentrate on the internal layers of
DNN representations. In this way our new class of adversarial images differs
qualitatively from others. While the adversary is perceptually similar to one
image, its internal representation appears remarkably similar to a different
image, one from a different class, bearing little if any apparent similarity to
the input; they appear generic and consistent with the space of natural images.
This phenomenon raises questions about DNN representations, as well as the
properties of natural images themselves.Comment: Accepted as a conference paper at ICLR 201
On Variational Learning of Controllable Representations for Text without Supervision
The variational autoencoder (VAE) can learn the manifold of natural images on
certain datasets, as evidenced by meaningful interpolating or extrapolating in
the continuous latent space. However, on discrete data such as text, it is
unclear if unsupervised learning can discover similar latent space that allows
controllable manipulation. In this work, we find that sequence VAEs trained on
text fail to properly decode when the latent codes are manipulated, because the
modified codes often land in holes or vacant regions in the aggregated
posterior latent space, where the decoding network fails to generalize. Both as
a validation of the explanation and as a fix to the problem, we propose to
constrain the posterior mean to a learned probability simplex, and performs
manipulation within this simplex. Our proposed method mitigates the latent
vacancy problem and achieves the first success in unsupervised learning of
controllable representations for text. Empirically, our method outperforms
unsupervised baselines and strong supervised approaches on text style transfer,
and is capable of performing more flexible fine-grained control over text
generation than existing methods.Comment: ICML 2020 Camera Ready. Previous title: Unsupervised Controllable
Text Generation with Global Variation Discovery and Disentanglemen
Implicit Manifold Learning on Generative Adversarial Networks
This paper raises an implicit manifold learning perspective in Generative
Adversarial Networks (GANs), by studying how the support of the learned
distribution, modelled as a submanifold , perfectly match
with , the support of the real data distribution. We show that
optimizing Jensen-Shannon divergence forces to perfectly
match with , while optimizing Wasserstein distance does not.
On the other hand, by comparing the gradients of the Jensen-Shannon divergence
and the Wasserstein distances ( and ) in their primal forms, we
conjecture that Wasserstein may enjoy desirable properties such as
reduced mode collapse. It is therefore interesting to design new distances that
inherit the best from both distances
Evaluating Lossy Compression Rates of Deep Generative Models
The field of deep generative modeling has succeeded in producing
astonishingly realistic-seeming images and audio, but quantitative evaluation
remains a challenge. Log-likelihood is an appealing metric due to its grounding
in statistics and information theory, but it can be challenging to estimate for
implicit generative models, and scalar-valued metrics give an incomplete
picture of a model's quality. In this work, we propose to use rate distortion
(RD) curves to evaluate and compare deep generative models. While estimating RD
curves is seemingly even more computationally demanding than log-likelihood
estimation, we show that we can approximate the entire RD curve using nearly
the same computations as were previously used to achieve a single
log-likelihood estimate. We evaluate lossy compression rates of VAEs, GANs, and
adversarial autoencoders (AAEs) on the MNIST and CIFAR10 datasets. Measuring
the entire RD curve gives a more complete picture than scalar-valued metrics,
and we arrive at a number of insights not obtainable from log-likelihoods
alone
On Posterior Collapse and Encoder Feature Dispersion in Sequence VAEs
Variational autoencoders (VAEs) hold great potential for modelling text, as
they could in theory separate high-level semantic and syntactic properties from
local regularities of natural language. Practically, however, VAEs with
autoregressive decoders often suffer from posterior collapse, a phenomenon
where the model learns to ignore the latent variables, causing the sequence VAE
to degenerate into a language model. In this paper, we argue that posterior
collapse is in part caused by the lack of dispersion in encoder features. We
provide empirical evidence to verify this hypothesis, and propose a
straightforward fix using pooling. This simple technique effectively prevents
posterior collapse, allowing model to achieve significantly better data
log-likelihood than standard sequence VAEs. Comparing to existing work, our
proposed method is able to achieve comparable or superior performances while
being more computationally efficient